Auto-Join: Joining Tables by Leveraging Transformations

نویسندگان

  • Erkang Zhu
  • Yeye He
  • Surajit Chaudhuri
چکیده

Traditional equi-join relies solely on string equality comparisons to perform joins. However, in scenarios such as adhoc data analysis in spreadsheets, users increasingly need to join tables whose join-columns are from the same semantic domain but use different textual representations, for which transformations are needed before equi-join can be performed. We developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able. We developed an optimal sampling strategy that allows Auto-Join to scale to large datasets efficiently, while ensuring joins succeed with high probability. Our evaluation using real test cases collected from both public web tables and proprietary enterprise tables shows that the proposed system performs the desired transformation joins efficiently and with high quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hash-based Symmetric Data Structure and Join Algorithm for OLAP Applications

Star schema is often used in dimensional approaches applied to OLAP applications. The fact table in the star schema typically contains a huge amount of data. When some of the dimension tables are also very large, it may take too much time and storage to join the fact table with these dimension tables. The performance of join algorithm becomes critical under such a condition. The uent join is a ...

متن کامل

Neighbor Table Construction and Update in a Dynamic Peer-to-Peer Network

In a system proposed by Plaxton, Rajaraman and Richa (PRR), the expected cost of accessing a replicated object was proved to be asymptotically optimal for a static set of nodes and pre-existence of consistent and optimal neighbor tables in nodes [9]. To implement PRR’s hypercube routing scheme in a dynamic, distributed environment, such as the Internet, various protocols are needed (for node jo...

متن کامل

AdaptDB: Adaptive Partitioning for Distributed Joins

Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best ...

متن کامل

A Visual Introduction to PROC SQL Joins.PDF

Real systems rarely store all their data in one large table. To do so would require maintaining several duplicate copies of the same values and could threaten the integrity of the data. Instead, IT departments everywhere almost always divide their data among several different tables. Because of this, a method is needed to simultaneously access two or more tables to help answer the interesting q...

متن کامل

SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora

Join is a powerful operator that combines records from two or more tables, which is of fundamental importance in the field of relational database. However, traditional join processing mostly relies on string equality comparisons. Given the growing demand for adhoc data analysis, we have seen an increasing number of scenarios where the desired join relationship is not equi-join. For example, in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017